happi: a hierarchical approach to pangenomics inference

Recovering metagenome-assembled genomes (MAGs) from shotgun sequencing data is an increasingly common task in microbiome studies, as MAGs provide deeper insight into the functional potential of both culturable and non-culturable microorganisms. However, metagenome-assembled genomes vary in quality and may contain omissions and contamination. These errors present challenges for detecting genes and comparing gene enrichment across sample types. To address this, we propose happi, an approach to testing hypotheses about gene enrichment that accounts for genome quality. We illustrate the advantages of happi over existing approaches using published Saccharibacteria MAGs, Streptococcus thermophilus MAGs, and via simulation. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-03040-6.

3. for b = 1, ..., B do (a) Sample π * from the set of permutations on n elements.Define X * b = π * (X), that is, X * b is a matrix of the same dimension as X = {X 1 , X 2 , . . ., X n }, but with the columns reordered according to the permutation π * .

S2 Sensitivity analysis using Streptococcus thermophilus MAGs
To assess the robustness of our results to different assumptions about the level of genome contamination, we conducted a sensitivity analysis of our results to different choices of ε, the probability of observing a gene given that it is truly absent.Our original choice of ε = 0.05 was motivated by the maximum contamination of genomes in our sample being approximately 5%, and for our original analysis we found 219 differentially present genes at the 5% FDR level.We reran our analysis for ε ∈ {0.01, 0.1}.Increasing ε to 0.1 resulted in 187 differentially present genes, while decreasing Fig. S1: Motivated by our S. thermophilus dataset with n = 157, we evaluate the Type 1 error rate control of happi-a for large sample sizes and varying degrees of correlation between X i and M i .happi-a is able to control the Type 1 error rate, behaving near-exactly at larger sample sizes with a rejection rate for a 5% test ranging from 4.9% (n = 150 and σ x = 0.25; 95% CI: 3.1-6.7%)to 6.8% (n = 125 and σ x = 0.5; 95% CI:: 4.6-8.9%).In contrast, GLM-Rao exhibits anti-conservativeness at larger sample sizes and all levels of correlation with rejections rates ranging from 20.6% (n = 125 and σ x = 0.5; 95% CI:: 17.1-24.0%)to 38.3% (n = 150 and σ x = 0.25; 95% CI: 34.2-42.4%).We show results for 500 simulations.Because happi-np is computationally expensive for large sample sizes, it was not feasible to run it for 500 simulations, and therefore we do not show its performance here.
ε to 0.01 resulted in 252 differentially present genes.The smaller number of p-values identified as differentially present by happi for larger values of ε is to be expected because differences in gene presence between groups are more attributable to erroneously gene observations.We compare differences in − log 10 (p-values) for results when using ε = 0.05 compared to − log 10 (p-values) when decreasing ε = 0.01 and increasing ε = 0.1 in Additional file 1: Fig. S3.Decreasing ε = 0.01 resulted in smaller p-values compared to p-values when using ε = 0.05.Conversely, increasing ε = 0.1 generally increased p-values in comparison to p-values when using ε = 0.05.

S3 Model misspecification simulation study
Here we investigate the Type 1 error rate of happi under model misspecification.We specifically focus on the misspecification of P r(Y i = 1|λ i = 1, M i ) = f (M i ) as a non-decreasing function in M i .This is motivated by the concern that in extremely deep short-read sequencing studies, short read assembly tools may fail to assemble fragments of a genome.Thus, the probability of detecting a gene given that it is present may be smaller for high coverages M i than for moderate coverages.Because obtaining such high coverages for metagenome-assembled genomes is unusual (because it requires extremely deep sequencing or very simple communities), happi estimates model parameters under the assumption that f (M i ) is non-decreasing.To investigate the robustness of happi to model misspecification in the monotonicity of f (M i ), we generate data with an "inverse-U" relationship between coverage and gene detection, and evaluate happi's error rate control.We We consider q = 1 and q = 2, X i1 = 1, X i2 = N ( i−1 n−1 , σ = σ x ), ε = 0 and β = (0, 0) T .We evaluate three intervals for the generation of our mean coverages M i : M i = 10 + 90 i−1 n−1 , M i = 10 + 190 i−1 n−1 , and M i = 10 + 290 i−1 n−1 .By evaluating these three intervals of M i , we can investigate the effect of increasingly large coverages on happi's error rate control.We simulate data according to the model described in (Eq. 1) and (Eq.2), and conduct 500 simulations.We ran 500 permutations for happi-np.As previously, GLM-LRT and GLM-Rao produced highly similar p-values, and therefore we only show results for GLM-Rao.
The results of our Type 1 error rate simulations are shown in Additional file 1: Fig. S2.In these simulations, all models are misspecified: happi is misspecified because of the monotonicity assumption, and GLM-Rao fails to account for differential quality in the genomes.We see that all methods have worse error rate control as the range of coverages increases (top to bottom), and the correlation between X i and M i increases (right to left).For low correlation between X i and M i (σ x = 0.5), happi-np and GLM-Rao both appear to control error rates at coverages up to 200X.Both methods have increased error rates as the correlation between X i and M i increases (σ x = 0.25).All methods suffer when the coverage ranges from 10-300X, which is not surprising as the true probability of detecting a present gene drops to below 30% at M i = 300 (Additional file 1: Fig. S4).
Across 6×500 = 3000 simulations, the average Type 1 error rate at the 5% level for each method was 13.1% for happi-a, 10.5% for happi-np and 8.2% for GLM-Rao.Thus, no method's hypothesis tests are robust to model misspecification.Based on our simulation results, happi fails to control the Type 1 error rate in more scenarios than GLM-Rao, suggesting that the happi is less robust to non-monotonic relationships between quality variables and detection compared to GLM-based approaches.Based on these results, we do not recommend the use of happi when analyzing MAGs reconstructed from extremely deeply sequenced short-read data, such as when coverages range up to 300X.However, we also note that GLM approaches may not be reliable in these settings either.Fig. S2: We evaluate the Type 1 error rate control of happi-a, happi-np, and GLM-Rao under model misspecification (a non-monotonic relationship between coverage and gene detection).We do not recommend the use of happi when analyzing MAGs reconstructed from extremely deeply sequenced short-read data, such as when coverages range up to 300X.However, GLM-Rao may also not be reliable in this setting.We compare the β 1 estimates from a model that uses mean genome-level coverage as the quality variable (x-axis) to the β 1 estimates from a model that uses genome completion as the quality variable (y-axis).The β 1 estimates are strongly correlated between the models, suggesting that our results are robust to our choice of M i .Data from Shaiber et al. 2020 [18]; see section "Data analysis: Saccharibacteria MAGs" for details.
construct a non-monotonic f (•) by fitting a generalized additive model with binomial link function to the observations shown inFritz et al. 2019 [22]  (Figure3, top left panel, Megahit 1.0.3,2% error rate curve).The fitted f (M i ) curve that we use for simulation is shown in Additional file 1: Fig.S4.
Fig.S3: We investigate the robustness of p-values obtained from running happi with different levels of contamination parameter ε.We compare happi with ε = 0.05 to happi with ε = 0.01 and ε = 0.1.We see that the p-values remain highly correlated across varying values of ε, and the magnitude of very small p-values (top right corner of plots) remains consistent across ε levels.When we decrease ε = 0.01 from ε = 0.05, the p-values tend to become smaller, while when we increase ε = 0.1 from ε = 0.05 the p-values tend to become larger.We discourage further exploration of genes whose significantly differential presence hinges on the assumption of low genome contamination levels, and is not robust across small increases in ε.Data fromRichardson et al. 2023 [19].

Fig. S4 :
Fig. S4: We estimated the relationship between coverage and the probability of gene detection for high-coverage short read metagenome sequencing studies in order to simulate data with a "inverse-U" relationship between quality variables and detection.Observations from Fritz et al. 2019 [22] (Figure 3, top left panel, 2% error rate curve using Megahit 1.0.3)were smoothed to construct an f (M i ) function.

Fig. S5 :
Fig.S5: Different quality variables M i can be used with happi.We compare the β 1 estimates from a model that uses mean genome-level coverage as the quality variable (x-axis) to the β 1 estimates from a model that uses genome completion as the quality variable (y-axis).The β 1 estimates are strongly correlated between the models, suggesting that our results are robust to our choice of M i .Data from Shaiber et al. 2020[18]; see section "Data analysis: Saccharibacteria MAGs" for details.